北京邮电大学学报

  • EI核心期刊

北京邮电大学学报 ›› 2007, Vol. 30 ›› Issue (6): 40-45.doi: 10.13190/jbupt.200706.40.028

• 论文 • 上一篇    下一篇

基于相对条件熵的搭配抽取方法

王大亮1, 张德政1, 涂序彦1, 郑雪峰1, 佟子健2   

  1. (1. 北京科技大学 信息工程学院, 北京 100083; 2. 搜狐公司 研发中心, 北京 100084)
  • 收稿日期:2007-02-04 修回日期:2007-03-26 出版日期:2007-12-31 发布日期:2007-12-31
  • 通讯作者: 王大亮

Collocation Extraction Based on Relative Conditional Entropy

WANG Daliang1, ZHANG Dezheng1, TU Xuyan1, ZHENG Xuefeng1, TONG Zijian2   

  1. (1. School of Information Engineering, University of Science and Technology, Beijing 100083, China;
    2. Department of Research and Development, Sohu.com Inc, Beijing 100084, China)
  • Received:2007-02-04 Revised:2007-03-26 Online:2007-12-31 Published:2007-12-31
  • Contact: WANG Daliang

摘要:

针对以往研究将搭配视为词项的简单并置,而没有考虑词项间的倾向性的问题,提出了一个基于相对条件熵的搭配倾向统计模型,衡量中心词对上下文同现词的依赖程度。此外,加入语言学启发式规则,利用词性过滤器和滑动窗口的方法识别搭配边界,最终形成了在开放语料库环境下的搭配抽取方法。该方法具有很强的解释性,有效地揭示了搭配构成的内在机理。经过证明,搭配倾向强度可以解释为由方向修正的互信息。

关键词: 自然语言处理, 搭配抽取, 相对熵, 搭配倾向性

Abstract:

Previous researches on collocation extraction considered that lexical combination was simply to put terms together, but ignored the collocation preference. To solve that problem, the collocation preference statistic model based on relative conditional entropy is brought up in this paper to measure dependence between headword and co-occurrence words in context. Then the linguistic heuristic rule is integrated to identify the border of collections, by part-of-speech filter and sliding window. Finally, an approach of collocation extraction is formulated. The approach is able to effectively disclose the internal mechanism of collocation and it is more understandable. It is proved the collocation preference strength could be considered as mutual information corrected by directions.

Key words: nature language processing, collocation extraction, relative entropy, collocation preference

中图分类号: